Please choose one task below and submit solution as homework:

  1. test for the autocorrelation of the distribution of single women OR single men in Denmark during 2020, and answer the question: "Is the population of single women/men in Denmark spatially correlated? What is the correlation and how significant is the trend?. You can download the attribute data either from Denmark Statistic, or find a slightly tidied dataset at this URL.
    OR
  1. test for autocorrelation in a dataset of your choice.

I’m chosing part 2 since this will be relevant for my final project. I’m going to look at sexual assaults in Denmark, and test whether or not the reported incidents (as a ratio) are clustered around certain parts of the country in the year 2017. And whether or not the amount of cases that leads to charges are clustered. This assignment contain some stuff that is not that relevant to the assignment, but which will be usefull for further analysis in my final project.

Part 1 Get data and wrangle it

I start by getting the relevant data. I have already saved and cleaned some data from the Danish bureau of statistics.

#Population in each municipality (used for calculating ratios)
population <- read.csv("data/pop.csv", sep = ";")
#Transform it to long format
population <- gather(population, year, population, X2010:X2020)

# Amount of reported assaults
reported_assaults <- read.csv("data/reported_rape.csv", sep = ";")
#Transform it to long format
reported_assaults <- gather(reported_assaults, year, reported, X2007:X2020)

#Amount of actual charges 
charged_assults <- read.csv("data/charged_rape.csv", sep = ";")
#Transform it to long format
charged_assults <- gather(charged_assults, year, charged, X2007:X2020)

#Spatial data with municipalities
municipalities <- readRDS("data/gadm36_DNK_2_sp.rds")
#Read municipalities as a spatial object
municipalities <- st_as_sf(municipalities)
#Give municipaliteis a CRS
municipalities <- st_transform(municipalities, crs = 32632)

#Clean names
municipalities$NAME_2[31] <- "Aarhus"
municipalities$NAME_2[21] <- "Høje-Taastrup"
municipalities$NAME_2[60] <- "Vesthimmerlands"

Now I need to merge my data frames and do some simple cleaning.

#Merge dataframes
assaults <- left_join(reported_assaults, charged_assults, by= c("year", "kommune"))
assaults <- left_join(assaults, population, by= c("year", "kommune"))
#Remove rows with NA (this basically means all years from before 2007 and all regions that aren't municipalities)
#This is done since I only have population data from 2010
assaults <- assaults %>% na.omit()

#Merge data with municipalities
assaults <- merge(assaults, municipalities, by.x = "kommune", by.y = "NAME_2")

#Remove the X that is in front of years
assaults <- assaults %>%  mutate(year = str_remove(year, "X"))

#Convert string to int in year
assaults <- assaults %>%  mutate(year = as.numeric(year))

Now I can calculate ratios which will be 1:10.000, and calculate the gap between reported cases of sexual assault and charges of sexual assault.

#Calculate ratios
assaults <-  
  assaults %>% 
  mutate(reported_ratio = (reported/population)*10000, #How many sexual assault cases are reported out of 10.000 people
         charged_ration = (charged/population)*10000,  #How many sexual assault cases are charged out of a population of 10.000 people
          gap_pct = (charged / reported)*100 ) #How many percent of the reported cases lead to charges?

Latsly, I need to convert my dataframe to a spatial object with a CRS.

#Make assaults a spatial object
assaults <- st_as_sf(assaults)
# Give it the correct crs
assaults <- st_transform(assaults, crs = 32632)

Part 3: Plot data

Now I can plot some of the data. First I will see how many assaults cases (out of every 10.000 people) there was in 2017.

assaults_2017 <- 
  assaults %>% 
  filter(year == 2017)

#Plot rapes in 2017 as a 
tmap_mode(mode = "view")
## tmap mode set to interactive viewing
tm_shape(assaults_2017) +
 tm_polygons("reported_ratio", style = "pretty",
             id = "kommune",
             title = "Sexual assault charges pr. 10.000 inhabitants (2017)")

Interstingly enough it appears that there are some huge variations with some places having 0 reported cases and other places having 9 cases out of every 10.000 residents. From a purely visual perspective it do look a little bit clustered around southern and northen Jutland and southern and northern sealand. But this might just be random.

Now let’s look at the gap between reported cases and actual charges

#As a percentage of cases that results in a charge
tmap_mode(mode = "view")
## tmap mode set to interactive viewing
tm_shape(assaults_2017) +
 tm_polygons("gap_pct",
             style = "pretty",
             id = "kommune",
             palette = "BuGn",
             title = "percentage of cases that results in a charge in 2017")

Here we see some stark contrasts, in some municipalities, every report lead to charges while in others, no case leads to a charge. Some of the extreme values are probably a result of low values (if we have 0 charges and 0 reports in a municipality, then it mathematically looks like, every single case have led to a charge, even though this notion is absurd). But I won’t deal with that now. Instead I’m going to use the auto correlation test the maps I created.

Part 4: Auto correlation test

#Simplify spatial object
assaults_2017 <- st_cast(st_simplify(assaults_2017, dTolerance = 250), to = "MULTIPOLYGON")

#Find neighbours based on queen movement
nb <- poly2nb(assaults_2017$geometry)

#Get center of each municipality
centers <- st_coordinates(st_centroid(assaults_2017$geometry))
# Run a Moran on reported cases (as a ratio)
moran.test(assaults_2017$reported_ratio, nb2listw(nb, style = "W", zero.policy = TRUE), zero.policy = TRUE)
## 
##  Moran I test under randomisation
## 
## data:  assaults_2017$reported_ratio  
## weights: nb2listw(nb, style = "W", zero.policy = TRUE)  n reduced by no-neighbour observations
##   
## 
## Moran I statistic standard deviate = 0.17506, p-value = 0.4305
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.001217848      -0.011111111       0.004960198

It looks like, the results are probably random when it comes to reported cases. This suggest that there isn’t any clustering. But let’s run a monte carlo simulation to be sure.

#Run a monte carlo simulation on reported cases (as a ratio)
moran.mc(assaults_2017$reported_ratio, nb2listw(nb, zero.policy = TRUE), zero.policy = TRUE, 
    nsim = 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  assaults_2017$reported_ratio 
## weights: nb2listw(nb, zero.policy = TRUE)  
## number of simulations + 1: 1000 
## 
## statistic = 0.0012178, observed rank = 565, p-value = 0.435
## alternative hypothesis: greater

With a p-value of 0.4, it most definitely is random in the year 2017.

What about, when looking at the percentage of reported cases that leads to charges?

# Run a Moran on the percentage of cases that lead to charges, I'm removing NA, since they might interfere
moran.test(na.omit(assaults_2017$gap_pct), nb2listw(nb, style = "W", zero.policy = TRUE), zero.policy = TRUE)
## 
##  Moran I test under randomisation
## 
## data:  na.omit(assaults_2017$gap_pct)  
## weights: nb2listw(nb, style = "W", zero.policy = TRUE) 
## omitted: 9, 10, 13, 53, 56, 68, 79, 89 n reduced by no-neighbour observations
##   
## 
## Moran I statistic standard deviate = 2.7022, p-value = 0.003444
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.209917884      -0.011764706       0.006729956

Interestingly enough, there is a P-value of 0.003, which suggest that this data is clustered around neighbors. What does the monte carlo simulation say?

#Run a monte carlo simulation on reported cases (as a ratio)
moran.mc(na.omit(assaults_2017$gap_pct), nb2listw(nb, zero.policy = TRUE), zero.policy = TRUE, 
    nsim = 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  na.omit(assaults_2017$gap_pct) 
## weights: nb2listw(nb, zero.policy = TRUE) 
## omitted: 9, 10, 13, 53, 56, 68, 79, 89 
## number of simulations + 1: 1000 
## 
## statistic = 0.2026, observed rank = 997, p-value = 0.003
## alternative hypothesis: greater

The p-value is all the way down to 0.003, this suggest, that theres is a spatial clustering in the municipalities, when it comes to the difference between reported cases of sexual assault and the amount that leads to charges. Maybe some municipalities have more focus/less focus on the issue. But again this is so preliminary an analysis, that I can’t make any general claims yet.